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8.1 Exploiting Independence Property 

8.1.1 Introduction 

Our goal is to model biological processes given biological data. It is essential for such a model 
to be of probabilistic nature, since measurements are noisy and molecular biology itself has 
stochastic characteristics. In this lecture Bayesian networks [4, 8] will be introduced ^, these 
provide graphical representation of joint probability distributions in a compact way. 

These networks are useful for describing systems composed of locally interacting compo- 
nents, that is, the value of each component depends on the values of a small number of other 
components. These networks also provide us a model of causal influence, as will be discussed 
later. One of the advantages of using Bayesian networks is the fact that statistical founda- 
tions for learning Bayesian networks from observations, and computational algorithms to do 
so are well understood and have been used successfully in many applications. These appli- 
cations include medical and fault diagnosis expert systems, monitoring systems, information 
access technologies, speech recognition systems, and finally analysis and classification for 
biological sequencing. 

8.1.2 Basic Probability Rules 
• Product Rule: 



Where P{A) is the probability of A, P(A, B) is the joint probability of both A and B, 
and P{A \ B) denotes the probability of A, given B. 

• Independence: Independence between A and B will be denoted as I{A]B). If such 
an independence exists, then: 




(8.1) 




(8.2) 



(8.3) 
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Conditional Independence: If A and B are independent given known C: 

P{A,B \C) = P{A\C) ■ P{B \C) (8.4) 

Or alternatively: 

P{A \B,C) = P{A I C), P{B \A,C) = P{B \ C) (8.5) 
Such an independence will be denoted as I {A; B \ C). 

Note that I{A]B) =^ I{A]B \ C): if A and B are independent, then they will be 
independent given any known C. Nevertheless, I {A; B \ C) ^ I {A; B): if A and B are 
independent given known C, it does not mean A and B are independent in any case. 

Total probability Theorem: Let . . . , be a set of disjoint events, which their 
union is the sample space: B^ — Q, \/i ^ j.Bi f] Bj = (p. 

n n 

P{A) = ^(^.^^) = E ^(^^) • I (8.6) 

i=l i=l 

The last formula is for the case 5 is a discrete variable, in case S is a continuous 
variable, the J (integral) function will be used to sum: 



P{A) = / P{A,B) dB = / P{B) ■ P{A I B) dB 

• Bayes Rule: 

_ P{B 
- P{B) 

Chain Rule: 

P{X-i, . . . , X„) = P{Xx I X2, . . . , Xn) ■ P{X2 I X3, . . . , Xn) ■ . . . • P{Xn^i \ X^) • P{Xn) 

(8.8) 



PiA I B) = ^(-g I f • I 5, C) = (8.7) 
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8.1.3 Exploiting Independence Property 

Independence of variables can be used to simplify significantly the representation of complex 
joint distributions. Consider the representation of the joint distribution of n variables, for 
example n tosses of a coin. Even in this simple binary example, explicit representation of 
the joint distribution will require 2"- entries, for all possible assignment of heads / tails to 
the tosses. 
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Table 8.1: Joint distribution representation of four fair tosses 



Example 8.1: Presenting Distribution of four Fair Tosses 

Let Xi = 1 denote heads in toss number i. The joint distribution of four tosses is represented 
in table 8.1. 

This is obviously a wasteful representation. If the independence of each toss is exploited, 
a much more compact representation can be achieved. Such a representation would include 
only the probabilities of each toss (table 8.2). In this degenerate case the tosses are of a fair 
coin, so the probability to get heads in each toss is 0.5. 
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Table 8.2: Independent distribution representation of four fair tosses 

The joint distribution P{xi,X2,X3,X4) would be computed using this table as 
P{Xi = Xi) ■ P{X2 = X2) ■ -P(^3 = x^) ■ P(X4 = ^4). In most biological applications, the 
majority of the variables are not completely independent, but exploiting the existing inde- 
pendencies can be used to achieve a rather simple representation of complex dependencies. 
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First a simple example will be presented to explain the principles of this method. 

Example 8.2: Two Variables Conditional Probability Distribution (CPD) 

Consider the problem of determining whether a suspect is guilty. The suspect is standing 
trial, and the judge finds him either guilty or innocent. The probability space is the joint 
distribution over these two variables: whether the suspect is guilty, G, and whether the 
judge will find him guilty, J. Assuming theses two variables are binary (0 - not guilty; 1 
- guilty), the joint distribution has four entries. Denote the case that G = by g'^, the 
probability the suspect is not guilty, and in the same manner we use g^, j'^, j^- A possible 
joint distribution of G and J is described in table 8.3. 
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Table 8.3: Joint distribution of two variables (the suspect-judge problem) 

This is the joint distribution representation which presents all the different assignments 
of valid values to the variables: The probability that the suspect is innocent and the judge 
finds him innocent is 0.38, the probability that he is innocent, and yet found guilty by the 
judge is 0.02, the probability that the suspect is guilty but found innocent is 0.06, etc. . . 

The same joint distribution can be alternatively displayed in a somewhat more natural 
manner. Instead of specifying the various joint entries, the product rule (equation 8.1) is 
used to get P{J,G) = P{J \ G) ■ P{G), so P{G) and P{J \ G) are to be represented. This 
will require the use of two tables, one representing the distribution of G, and the other 
representing the conditional probability distribution ( CPD) of J given G. This representation 
is denoted as a factorial representation and is presented in table 8.4. 

The distribution P{J \ G) represents the probability for the judge to find the suspect 
guilty in the two possible alternatives: In the case of g^ - suspect not guilty and in case of 
g^ - suspect is guilty. It is easy to see in this representation the false negative probability: 
I g^) - the judge found the suspect innocent although he was guilty (10% in this 
example), and the false positive probability: P{j^ \ g^) - the probability of an innocent 
suspect to be found guilty (5%). 

It's also easy to switch between the tables using the product rule(equation 8.1), for exam- 
ple in the joint distribution representation appears P{g^j^) = 0.38 (suspect found innocent 
rightfully). This can be achieved from the factorial representation using the product rule 
P( J, G) = P{J I G) ■ P{G), P{g^f) = P{f I g^) ■ P{g^) = 0.95 • 0.4 = 0.38 
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Table 8.4: Factorial representation of two variable distribution (suspect-judge problem) 



Switching from joint distribution representation to factorial is done simply by summing the 
relevant probabilities: P{g^) = P{g^f) + = 0.38 + 0.02 = 0.4 

It is instructive to consider the number of independent parameters needed in each of the 
representations. The joint distribution representations requires three independent parame- 
ters, as there are four entries, and their sum is known to be 1, so the fourth completes to 1. 
The factorial representation consists of three probability distribution: 

1. P{G),soP{g') + P{g^) = l. 

2. P{J\g^),soP{f\g') + P{3^\g')^l. 

3. P{J\g^),soP{j^\g^) + PU'\g') = l. 

Therefore, only three parameters are required. Note that in this example the two represen- 
tations use the same number of parameters, but as will be seen further, the factorial method 
enables a more compact representation. 

Bayesian networks will be formally defined further on, in Section 8.2, yet it will be useful 
to consider how a Bayesian network of this example will be represented. A Bayesian network 
in this case will have two nodes, representing the two variables G and J, and a single edge 
from G to J, representing J depends on G (the direction of dependence in this model is from 
J to G). Figure 8.1 shows the Bayesian network in this simple case: 

Example 8.3: Three Variables Conditional Probability Distribution (CPD) 

Now a more complicated setting will be considered, where the suspect is being tested on a 
polygraph during the investigation. As polygraph results are not adequately reliable, they 
are not admissible in court, and therefore the judge does not see the results. This means his 

verdict is independent of the polygraph's results. Now there are three variable to consider: 
G, J and the outcome of the polygraph, L. Assuming the polygraph results are also binary, 
the joint distribution has eight entries. Table 8.5 presents a possible joint distribution of G, 
J and L. 
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Figure 8.1: Bayesian network for the suspect-judge example. Each variable in the network is associated with a local 
probability distribution 
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Table 8.5: Joint distribution of three variables (the suspect-judge-polygraph problem) 



The result of the polygraph and the judge's verdict clearly depends on whether the 
suspect is guilty or not. The polygraph outcome and the verdict are also dependent: if it is 
given that one of them found the suspect guilty, it increases the probability that the other 
will also find him guilty. 

However, there is a conditional independence: If it is known whether the suspect is 
guilty, the outcome of the polygraph does no longer give information about the verdict, and 
vice versa, assume it is known the suspect is not guilty, the fact that the judge found him 
guilty, for example, does not increase the probability the polygraph will show the same. 
Denoted by the polygraph outcome which states that the suspect is guilty and P otherwise. 
Then, P{1^ \ g^,j^) = P{1^ \ g^). More generally it can be assumed that I{J]L \ G), this 
means if G is given, L and J are independent. Note that this assumption is based on the 
fact that the judge and the polygraph take into considerations completely different aspects 
of the case. 
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The conditional independence allows to provide a compact representation of the joint 
distribution, as seen before. Based on the product rule (8.1), the joint distribution is: 

P{G, J, L) = P( J, L\G)- P{G) 

The conditional probability assumed implies that: 

P( J, L\G)^P{J \ G)- P{L I G) 

Hence, 

P(G, J, L) = P{J I G) ■ P{L I G) ■ P{G) (8.9) 

Equation 8.9 leads to the desired factorial representation. Therefore, in order to specify fully 
the joint distribution of this case, we have to know P{G), P{J \ G) and P{L \ G). Note 
that P{G) and P{J \ G) distributions stays the same as in example 8.2 (table 8.4). This 
means that expanding the previous example to include also L required only the distribution 
of P{L I G) to be added, this distribution is displayed in table 8.6. That is in contrast to 
the eight entries joint distribution representation, in which the whole distribution has to be 
recalculated. 
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Table 8.6: The conditional P{L \ G) distribution which together with table 8.4 , matches the joint distribution described in 
table 8.5. 

Together these conditional probability distributions fully specify the joint distribution. For 
example: P{g^,f,l^) = P{g^) ■ P{f \ ■ P{1^ \ g^) = 0.4 • 0.95 • 0.2 = 0.076. Unlike the 
previous example, this one demonstrates a situation in which the factorial representation 
is more compact: only five independent parameters suffice to fully specify the whole distri- 
bution with conditional dependencies (for example P{g^), P{j^ \ g^), P{j^ \ g^), P{i^ \ g^), 
P{lP I 51^), since the other only completes to 1), whilst seven parameters are required us- 
ing the joint representation (the eighth parameter completes to 1). Note that we succeed 
to reduce the number of parameters due to conditional independence between the variables. 
Moreover, the factorial representation has another important property, which was mentioned 
before, modularity. When adding a new variable to a model, L in this example, only the 
local probability model for L needs to be added, while the local probabilities of G and J 
are reused. In joint distribution, this is not the case, and all the probabilities have to be 
recalculated, as the distribution changes entirely. 

The Bayesian network corresponding to this scenario (figure 8.2) contains an additional 
node in its network, representing L, and an edge from G to L, denoting the direction of 
dependence: L depends on G. Note that given the value of the parent G, both L and J are 
independent. 
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Figure 8.2; Bayesian network for the suspect-judge-polygraph example 

8.2 Bayesian Networks 

8.2.1 Representing Distributions with Bayesian Networks 

Suppose we are given a set of assertions and a variety of ways in which they support each 
other. Each assertion estabhshes a value for an attribute and is of the form (Xj = Xj), that 
is, "Variable Xi has value Xi" . The variables are Xi, . . . , X„. We would know everything we 
need to know about the world described by these assertions if we had the joint probability 
P{Xi, . . . , Xn). From this probability function we could compute any other probability such 
as P{X2) or P{X2 \ X3,X5). Unfortunately, even when assuming for simplicity that the 
variables are binomial, the representation complexity of P{Xi, . . . , X„) is, as we've seen, 2", 
which is impractical even for small value of n. 

Bayesian Networks simplify this problem by taking advantage of existing causal connec- 
tions between assertions, and of assumptions about conditional independence. A Bayesian 
network is a representation of a joint probability distribution. This representation consists 
of two components: 

• The first component, G, is a directed acyclic graph (DAG) whose nodes correspond to 
the random variables Xi, . . . ,X„, and its edges correspond to dependencies and their 
directions. This component is know as the Qualitative part. 

• The second component, describes the local probability model, the conditional probability 
distribution (CPD) for each variable, given its parents in G. Let Xi be a variable and 
Pa(Xj) its parents in G, the CPD of Xj is the distribution of P(Xj | Pa(Xj)). This 
part is the Quantitative part 

Together, these two components specify a unique distribution over Xi, . . . ,X„. The graph 
G represents conditional independence assumptions that allow the joint distribution to be 
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decomposed, economizing on the number of parameters. Note that the graph G encodes 
the Markov Assumption: Each variable Xi is independent of its non- descendants, given its 
parents in G (it is still dependent on its descendants). It is a natural assumption for many 
causal processes. If the parents of an event, meaning the events which directly caused it, are 
known, it is independent of the events which effect its parents, or of any other event, except 
the events which it is the cause for, or the cause for one of their ancestors. 

By applying the chain rule of probabilities and properties of conditional independencies, 
any joint distribution can be decomposed to the product form according to the Markov 
assumption on it's Bayesian network. Suppose w.l.o.g (without loss of generahty) Xi, . . . , Xn 
are arranged in reversed topological order, i.e. if Xj is a descendant of Xi in the network 
then j < i. According to the chain rule (equation 8.8): 

P{Xi, . . . , Xn) = P{Xi I X2, X3, . . . , Xn) ■ P{X2 I X3, X4, . . . , Xn) "... •P(X„_i | X^) ■P(X„) 

Since each Xi does not depend on any of its non-descendants given its parents, and any 
variable Xj that is a descendant of Xi has a lower index (j < i) due to the reversed topological 
order, then P{Xi \ Xj+i, . . . ,X„) = P{Xi \ Pa(Xi)), where Pa(Xi) is the set of parents of 
Xi in graph G. Thus, we get the chain rule for Bayesian networks: 

n 

P(Xi,...,X„) = J]P(X, I Pa(X,)), (8.10) 
1=1 

Example 8.4: A Simple Bayesian Network with Five Variables 

In Figure 8.3 we can see a simple example of a Bayesian network structure. This network 
describes the connections between the following events: 

m B - There is a Burglary. 

• A - The Alarm goes off. 

m E - There is an Earthquake. 

• R - There is a Radio report of an earthquake. 

• G - Mr. Watson, the neighbor. Calls to inform us he heard the alarm. 

The alarm is set to detect earthquakes and attempts of burglary and. Naturally, there is 
a probability of a false alarm to occur, or for some malfunction to cause the alarm not 
to operate when it should. Mr. Watson is the neighbor, and in case he hears the alarm 
he should call us to inform the alarm had gone off. He might, of course, not hear it or 
mistakenly think he had heard it (though it hadn't gone off). In addition the local radio 
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Fig lire 8.3; Simple Bayesian Network with Five Variables 

station usually reports any case of an earthquake. 

Specifying the joint distribution of these events, requires 2^ — 1 = 31 parameters. This is quite 
a high number for such a simple system. Consider an expert system for monitoring intensive 
care patients, which measures over 37 different properties. Joint distribution representation 
of such a system will require at least 2^^ parameters. This is not feasible. Instead, we will 
use a Bayesian network to represent such systems (figure 8.3). 
Following is the list of independencies of this network: 

• I{E,B) - Given the parents of E (none), E is conditionally independent of its non- 
descendant: B. 

• I{B,{E,R)) - Given the parents of B (none), B is conditionally independent of its 
non-descendants: E and R. 

• I{R, {B, A, C) \ E) - Given the parents of R (E), R is conditionally independent of its 
non-descendants: B, A and C. 

• I [A, R \ {B,E)) - Given the parents of A {B,E), A is conditionally independent of its 
non-descendant: R. 

• I{C, {R, B, E) \ A) - Given the parents of C (A), C is conditionally independent of its 
non-descendants: R, B and E. 

Consider the first independence, I{E,B): E has no parents, so in any case it is inde- 
pendent of B. Indeed there is no dependence between the events of a burglary and an 
earthquake. The last independence, I{C, {R, B, E) \ A) means that if the parents of C, 
which is A, is known then C is independent of R, E and B. This correlates with the rational 
of the events: if it is known that the alarm broke off (A), then whether the neighbor will 
or will not call (C) because of it, does not depend on the radio report (R), the earthquake 
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(E) or the burglary (B). This is true despite the fact that the earthquake and the burglary 
might be the reason for the alarm. 

According to the chain rule (equation 8.8), the joint distribution is: 

P{C, A, R, E, B) = P{C I A, R, E, B) ■ P{A \ R, E, B) ■ P{R \E,B)- P{E \ B) ■ P{B) 

Which requires 31 parameters. Alternatively, using independencies in the Bayesian network 
(equation 8.10), the joint distribution of the five events is 

P(C, A, R, E, B) = P{C I A) ■ P{A \B,E)- P{R \ E) ■ P{E) ■ P{B) 

This distribution requires only 10 independent parameters. 

Complexity Analysis 

To fully specify a joint distribution, we need to specify the conditional probabilities in the 
product form. The quantitative part of the Bayesian network describes these conditional 
distributions, P{Xi \ Pa(Xj)) for each variable Xj. Suppose the variables have k possible 
values (If the variables were binomial then k = 2). Each one of these conditional distributions 
contains k^^^^^^^^ independent parameters. Let / be the maximum number of possible parents, 
then for each variable Xj, there are < A;' parameters to be stored. Hence for n variables, 
the representation complexity is 0{n ■ k^). Note that for a joint distributions with many 
variables, each of which has only few dependencies, then / <^ n and the representation 
complexity is much better then the 0{k^) representation complexity in the joint distributions 
representation. 

8.2.2 Representing Markov chains and HMM as Bayesian net- 
works 

A Markov chain can be treated as a simple case of a Bayesian network. A Markov chain is a 




Fi gUre 8.4; A Markov chain represented as a Bayesian network 

triplet A, where Q is a set of states, p is the probabilities of the initial state, and A is 
the state transition probabilities, which contains for each two states s,t & Q the probability 
ttst = P{Xi = t I = s). X = { ) is a random process (for additional details refer 

to HMM scribe 5). An equivalent Bayesian network will be a chain of variables Xi, . . . , X„ 
(figure 8.4), thus is the parent of Xj. Each variable may attain any value s E Q. The 
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CPD of Xi will be p. The CPD of variable Xj is P{Xi \ Xj_i) = agt, i.e. the probability 
of observing value s in variable Xi is dependent on the value in it's parent Xj_i in the 
Bayesian network. Therefore computing -P(X) = P{xi) ■ Y[i=2 ^^i-i^i using Markov chain, is 
equivalent to the factorial representation of the joint distribution in the Bayesian network: 

P(X) = P(Xi = Xi) • nf=2 = ^^ I ^-1 = 

Hidden Markov model (HMM), can also be represented using Bayesian networks. In 
addition to the Markov chain, it contains emission probabilities, which will be represented 
as additional nodes {Yi) in the Bayesian network (figure 8.5). The emission node will have no 
descendants, as nothing depends on them. Each variable may attain any value i/i E T,. The 
CPD of Yi will be represented as the emission probability, i.e. P{Yi = yi \ Xi = Xi) = CxXUi)- 
The CPD of each variable Xj remains the same as in the Markov chain model. 




i i ; i 




Figure 8.5: A HMM as a Bayesian network 

As demonstrates, it's possible to represent HMM using Bayesian network, but this rep- 
resentation is wasteful, since aki and ek{b) appear identically n time, once in each node. In 
the CpG problem, for example, locating CpG islands in a sequence of 2,000 nucleotides, will 
require an HMM with two states, but the Bayesian network of the same problem will contain 
2,000 nodes representing the probabilities of the states in each one of the nucleotides, and 
another 2,000 for the emissions probabilities. Therefore for random processes with mem- 
ory of length 1, it will be easier to use HMM rather then Bayesian networks. However, in 
HMM each state depends only on the state before it, and more complicated dependencies 
cannot be expressed. As the Bayesian network model generalizes, and represents high-order 
dependencies, it should be used in cases of complicated dependencies. 



8.3 Inference in Bayesian networks 
8.3.1 Introduction to Inference 

So far it has been shown that Bayesian networks can be used to represent probability distri- 
butions in a compact and intuitive manner. As it is so, Bayesian networks should contain 
information that would answer any query about the distributions represented by them. This 
section is devoted to the way such queries can be answered, using the data available by the 
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network. Given a Bayesian network for the variables Xi, . . . ,Xn, frequent types of queries 
we are interested in are P{xi), P{xi \ Xj = Xj), P{xi, . . . ,Xn \ Xj = xj) etc. 

Bayesian networks can be used to answer these queries, since the joint distributions 
can be generated using the chain rule for Bayesian networks (equation 8.10). The joint 
distribution can then be used to make the necessary calculations for the query. A naive 
solution to compute P{Xi) based on the total probability theorem (equation 8.6), is 
P{Xi) = Y,x^ ■ ■ ■ ■ T.x,+i ■ ■ ■ • • • ' ^n) Using the joint distribution represen- 

tation, for binomial variables, the complexity of solving this is exponential (0(2")). For 
variables with up to k possible value, it is 0{k"'). Avoiding exponential complexity, was 
one of the main reasons to use Bayesian networks, and in the following section it will be 
demonstrated how the use of them might reduce the complexity. The inference problem in 
Bayesian networks is NP-hard. It means that in the worst case the best algorithm available 
to solve these problems is exponential (assuming P ^ NP). In other words - performing the 
computations on all the joint distribution entries is the best that can be done in worst case. 
But it turns out that the worst case comes up only very rarely. So, although the complexity 
of Bayesian networks algorithms in worst case is exponential, in practice, a simple method 
can be used to make the computations in most practical cases quite fast. This technique is 
called Variable elimination algorithm. 

8.3.2 Variable Elimination algorithm 

The Basic Idea 

The idea of eliminating variables will be demonstrated through a basic inference task on a 
simple Bayesian network. Consider the "chain" network in figure 8.6. Suppose we want to 




Figure 8.6; Simple Bayesian network in form of a nodes chain 

calculate P{A = a,D = d), the distribution of the joint probability of A and D: Using the 
total probability equation (8.6), we get: 

P(a,rf) = 5^5^P(a,6,c,rf) 

b c 

Note that each addition is used to eliminate one variable: Summing for all the probabilities 
of c on the joint distribution, P{a,b,c,d), will yield P{a,b,d), thus eliminating C. In the 
same manner summing up all the probabilities of b on P{a, b, d) will eliminate b, yielding the 
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requested probability P{a,d). Using the factorial representation, instead of the joint one, 
will give us: 

P(a, d) = J2Yl ^' ^) = XlZl I ^)^^^ I ^)^^^ I ")^^") 

be be 

This representation embodies the power of the Bayesian network, as it uses the conditional 
probabilities instead of the joint, and these will be used to simplify the additions and reduce 
the complexity. 

First, c will be eliminated. Note that when summing for all C = c on the conditional 
probabilities, some terms need not to be summed, as their CPDs do not contain the variable 
C. That means we can omit these terms from the sum on c, thus reducing the complexity, 
as less calculation has to be performed to accomplish that sum: 

No c in CPDs 

^ 

P(a, d) = ^^P{d\ c)P{c I b) P{b \ a)P{a) = ^P{h\ a)P{a) ^ P{d \ c)P{c \ b) 

be be 

Now, in order to complete the elimination of c, we will use the chain rule (equation 8.8), 
applying ^^P{d \ c)P{c \ b) = P{d \ b): 

p(d\h) 

w 

P(a, d)^Y^ P{b I a)P(a) ^i^ I c)P(c | 6) = ^ P(6 | a)P{a)P{d \ b) 

be b 

c was eliminated from this equation, and the factor of P{d \ b) was calculated. P{d \ b) is an 
intermediate factor of the algorithm. The next step will be to eliminate B. As before, this 
will be done by summing on every B = b, after excluding the terms that do not include b 
(in this case P(a)) from the addition. 

^ P{d\a) 

P{a, d)^Yl ^(^ I a)P{ajP{d \ b) = P(a) ^ P{d \ b)P{b \ a) = P{d \ a)P{a) 

b b 

b was eliminated from this equation, and the intermediate factor P{d \ a) was calculated. 
Next, in order to compute P(a, d), we only need to multiply P{a) by the intermediate factor 
P{d I a), which was already calculated. 

Generalizat ion 

The technique presented can be generalized to solve any specific distribution of a variable, 
or joint distribution of a subset of variables: Let P{xi) be the distribution we're looking for 
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in a network of n variables. First the query will be written in the following form: 

Xn X3 X2 i 

Where poj are the parents of Xi in the Bayesian network. In the last example, Yii P{xi \ pai) = 
P{d I c)P{c I b)P{b I a)P(a), After having the query in that form, perform iteratively: 

• Move all irrelevant terms outside of innermost sum. 

• Perform innermost sum, getting a new intermediate factor. 

• Insert the intermediate factor into the product 

So far we've seen how to answer queries about the distribution of a variable, or the joint 
distribution of few variables. Consider a conditional probability query, let Xi be the query 
variable given — Xk, so the distribution of P{Xi \ Xk) is requested. The query will 
be managed in a similar way as in the previous case: First P{Xi,Xk) and P{xk) will be 
calculated, as follows. Note that for P{Xi,Xk) we sum neither on Xi nor on x^- 

n 

p{x„ = E • • • E E • • • E n ^(^^ I p""^) 

Xn Xk+1 Xk-1 X2 i=l 

n 

Xn Xk+1 Xk-1 XI i=l 

P(Xi Xk) 

Next P{Xi I Xfc) is found, using the product rule (equation 8.1): P{Xi \ x^) ^ ' 



P{Xk = Xk) 



Complexity Analysis 

Let's observe the complexity of solving the queries presented. As mention in section 8.3.1, 
solving such queries in a Bayesian network is NP-hard. Therefore, in the worst case, time 
complexity is expected to be exponential. But most practical cases do not yield such net- 
works. In the following calculations the variables will be treated at first as binomial, to 
simplify the calculations. In order to find the distribution of Xi, the following expression 
needs to be calculated: P{xi) = • • • P{x2, ^3, ■ ■ ■ , Xn)- 

Xn X3 X2 

In the naive case, all the additions will be performed on the joint distribution. Since we sum 
over all the possible combination of Xi, . . . . ,t„, this will cost 0(2"'). In the multinomial case, 
where Xi has k possible values, the complexity will be 0{k"'). 

We've seen that time complexity might be reduced, if the factorial representation is 
used and each sum is done only on the relevant factors. In this method we start with the 
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expression P{xi) — • • • JJ^ P{xi \ pai), and then perform 0{n) additions, each on 

its relevant factors. Let's observe the complexity of the Xj-th summation: We sum only 
on the relevant factors of the product, let r, be the number of variables in the intermediate 
factor generated in summation of Xj. For example, summing ^^-^(^ I ' -^('^ I ^' 
generate the intermediate factor Pi^A \ b, d) with r = 3. In the intermediate factor there 
are 0(2''') possible combinations, and each one is calculated by multiplying 0(ri) factors. 
Therefore summing Xi will cost 0(rj • 2^'). Let r = max"^^{ri}. Summing 0{n) variables 
will cost 0{n • r • 2''). In the multinomial case, the complexity will be 0{n -r -k"^). Note that 
this is significantly smaller than 0{k^) only if each variable has a small number of parents in 
the network (few parents means few dependencies, which will give factors with few variables, 
hence a low r). 

Finally note that the number of factors, and hence the complexity of the elimination, 
strongly depend on the order of elimination. Even in the simple elimination example on the 
chain network described in figure (8.6), containing only the four variables A,B,C and D 
this fact is manifested, consider the following order of elimination: 

Pid) = J]J]J]P(a,6,c,d) = 5]5]5;P(d|c)P(c|6)P(6|a)P(a) 

c b a c b a 

Pib) 



5] 5] P(ci I c)P{c I b) J2 Pib I a)P{a) = P(ci | c)P{c \ b)P{b) 



c b a c b 

P{c) 

. . 



c b c 

Each one of the products being summed contains 1 factor at most: Just one variable except 
the one that we sum upon. Alternatively, consider the following order of elimination: 

= EEE^(°'^'^'^) = EEE^(^i^)^(^i^)^(^i«)^(°) 

a b c a b c 

P{d\a) 



= E E I I = E ^^'^ I = ^i^^ 

a b a 

In this elimination all the products except the last contains 2 factors, hence the complexity 
of this elimination is greater, since the order of elimination wasn't optimal. Choosing an 
efficient order of elimination can be done using a clique tree algorithm, which will not be 
discussed here. 
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8.4 Learning Bayesian Networks 

8.4.1 Introduction [5] 

The amount of available information is growing rapidly. However, knowledge acquisition is 
an expensive process. Learning allows us to construct models from raw data, which can 
then be used for many tasks. Bayesian networks in particular, which convey conditional 
independencies, enable us to capture the structure of many real- world distributions. 

8.4.2 The Learning Problem 

The problem of learning a Bayesian network can be stated as follows: 

Let m be the number of samples and n the number of variables. Given a training set 
D = (Xi, . . . ,Xm), where Xj = {xn, . . . ,Xin), and prior information, find a network that 
best matches D. The problem of learning a Bayesian network can be learning the CPD (the 
parameters) given a certain structure, as in figure 8.7, or learning both the distributions and 
the graph's structure (the dependencies), as in figure 8.8. In addition, the learning problem 
can be divided into two other cases: complete data and incomplete data. In complete data, 
all parameter values are known, whereas in the case of incomplete data, some of the values 
in the vectors = {xa, . . . ,Xin) are missing. 

The means of handling these aforementioned cases, are described in table 8.7. 





Known Structure 


Unknown Structure 


Complete Data- 
All instances of the 
variables are known 


Statistical parametric 
estimation (closed-form 

equations) 


Discrete optimization over 
structures (discrete search) 


Incomplete Data- 

Not all instances of the 
variables are known 


Parametric optimization 
(EM, gradient descent...) 


Combined (structural EM, 
mixture models...) 



Table 8.7: Means of learning the network in different cases 



We will focus on complete data for the rest of the talk. We start with the case of a known 
structure, whose parameters we wish to learn and estimate. 

8.4.3 The Likehhood Method 

One way of finding an estimator for a parameter is the likelihood method. According to this 
method, given a sequence of samples, we look for the parameter's value that seems most 
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Know n Structure, Complete Data 

4 



E, B, A 
-:Y,N,N> 
-:Y,N,Y> 
-:HHY> 
-:r{Y,Y> 







e b 
e b 
' b 







P^A 1 Fit 


e b 
e 1 
~e b 

1? 


.P .1 
.7 .3 
.B .2 
.99 .01 





Fi gUre 8.7; Learning from a known structure and complete data 



U nknow n Structure, Complete Data 





P^A IPS 


e b 
e b 
"e b 





<2) CE? 
(35 



£ e 




e b 
e "b 


.9 .1 
.7 .3 
.B .2 
.PP .01 



Fi gUre 8.8; Learning from an unknown structure and complete data 

"likely" to have caused this certain sequence of samples (i.e. gives our sample the highest 
probability compared to all other possible values of the parameter). In other words, we try 
to deduce the value of the parameter after a certain sample has been received, namely the 
value of the parameter that best explains the given sample results. 

Likelihood Definitions 

• The Likelihood Function [7, 6] 

Let Xi, X2, . . ■ , Xm be samples from a population with a certain distribution with 
an unknown parameter 6 and let Xj = x[i] be the training data- the sample results 
obtained, each result independent of the others. The likelihood function is defined as: 



L{9 : D) = P{D 



P(Xi = x[l],X2 = x[2],...,X„ = x[^ 



x\m\ 



1=1 



X\l\ 



where 9 gets all the possible values of the parameter. 



Maximum Likelihood Estimator 

Given a sample result, 9 is the maximum likelihood estimator (MLE) for the parameter 
9 if it holds that \I9.L{9) > L{9). In other words, 9 is the value of the parameter 9 
which maximizes the likelihood function. 
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8.4.4 Example: Binomial Experiment 

A coin can land on one of two positions: H or T, each with an unknown probabihty. We 
denote by 6 the unknown probabihty P{H). 

Estimation Task: given a sequence of samples x[l], a;[2], . . . , x[m] we want to estimate the 
probabihty P{H) = 9 and P(T) = 1-9. 

Suppose we performed six independent Bernolh experiments and received the following sam- 
ple results: H, H, H, H, T, T. We denote by Xi the result of the i-th coin toss. The likelihood 
function for the obtained results is: 

L{9 : D) = P{D I 9) = P{Xi = H,X2 = H,Xs = H,Xi = H^X^ = T,Xq = T \ 9) = 

= Pe(X, = H)Pe(X2 = H)Pe(Xs = H)Pe(X4 = H)Pe(X5 = T)Pe{Xe = T) = 

= 9\1 - 9f 

where (*) is because of independent coin tosses. 
Note the following: 

• For a known 6*, we would have an exact numerical value denoting the probability to 
get the training data results. 

• The coefficient (^) in the binomial distribution is omitted here, since we have the 
exact sample sequence and for each result in the sequence we are interested only in the 
probability of obtaining that certain result. 

In addition, even if we write the coefficient, it will eventually "disappear" from the 
equation, as explained in the steps to come. 

We can formulate the following general case: 
Let Nh be the number of occurrences of H and Nt be the number of occurrences of T in the 
training set, so that Nh + Nt = m. 

L{9 : D) = P{D I 9) = 9h^" ■ 9t^^ 

where 9^ is the probability of the outcome H and 9^ is the probability of the outcome T. 

As described earlier, the MLE principle is to choose the value of the parameter that 
maximizes the likelihood function. Applying the MLE principle means we have to find the 
maximum point of the likelihood function. In order to do this, we will first switch to the log 
(or In) of the likelihood function. This is done in order to make the derivative calculations 
easier (instead of a complicated derivative of a long product term the log turns the product 
into summation, thus making the calculation of the derivative simpler). The logarithm is a 
monotonically increasing function and therefore the maximum point of the function L{9) is 
the same as the maximum point of the function logL(^). 
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Returning to finding the maximum point of our likeliliood function, we get: 

InL(^) = 41n^ + 21n(l -e) 
d\nL{e) _ 4 2 

de ~e~ 1-e 



e = 2/3 

or in the general case: 

Nh + Nt 

We can now see that had we written the coefficient, it would have been omitted anyway, 
after applying the log and finding the derivative. 

Using the above general equation, we can easily find the MLE estimation for other cases. 
For example, given {Nh, Nt) = (3, 2), the MLE estimation is | = 0.6. 




MLE 

Figure 8.9: The maximum likelihood estimation 

8.4.5 Learning Parameters for Bayesian Networks 

When learning parameters for Bayesian networks, the training data has a form as in figure 
8.10. 
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D-- 



11 



V ml 



{ E\\\ ^[1] 



E\Yn\ ^[m] C[m] 



Figure 8.10: The training data, including the network structure 

Since we assume i.i.d (independent identically distributed) samples, the likelihood 
function is: 



Lie ■ D) = Y[P{E[m],B[m],A[m],C[m] : 9) = JJ 



/ P{E[m] -.e)- \ 
P{B[m] : 9) ■ 

P{A[m] I B[m],E[m] : 9) ■ 
\ P\c[m] I A[m] : 9) ) 



the factorial representation 

where (*) is due to switching to the factorial representation, by definition of the network in 
figure 8.10. Rewriting the terms (looking at the columns of the training data instead of the 
rows) we get: 

L{9 ■ D) = Y[P{E[m] : 9)Y[P{B[m] : 9)Y[P{A[m]\B[m],C[m] : ^) JJ P(C[m] |A[m] : 9) 

m m m m 

As can be seen from the calculation above, we have gone from the product of the probabilities 
of all nodes for each m, to the product of the likelihood of each node (given its parents, if it 
has any). 

Generalizing for any Bayesian network we get: 

L{9:D) = l[ P(xi[m], . . . , x„[m] : ^) ^ 

network factorization 



JJ JJP(x4m] I Pa,[m] : 9^ 

i m 

\[L,{9r.D) 



where Poj is the set of node Xj's parents. 

We can see here the decomposition principle: the likelihood decomposes according to the 
structure of the network meaning that we can calculate each node's log-likelihood separately 
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and sum them all to get the log-likelihood of the entire network. The decomposition enables 
us to have independent estimation problems. In other words, since the parameters for each 
variable are not dependent, they can be estimated independently of each other. As a result, 
we can use the MLE formula we already described and get 6xi\pai = — — - — where 
pai is a certain combination of the parents (for example, see figure 8.11). 




Figure 8.11: Demonstrating the decomposabihty principle- the variable Xi has three parents, which are the members of 
the set Pai. Each row of the variable's CPD is a different combination of the parents' values and the columns are the different 
values Xi can receive. Each row is Xi's distribution given the parents' specific combination (the probabilities in each row 
sum up to 1) and the parameter is estimated according to the formula for 8. In this figure, we wish to estimate the value of 
8 = P(xi = 0) for a given parental combination 010. 



8.4.6 Likelihood Function: Multinomials 

Up to now we dealt with binomial variables that had either the value or the value 1 
("failure" or "success"). We now make the transition to the multinomial case. 
Suppose our variable X can have the values 1, 2, . . . , A;. We would like to learn the parameters 
Oi., ... .,9k (for example each 9i corresponds to the probability of getting a specific number 
on a die). We denote hy Ni, . . . , the number of times each outcome is observed. We get 
the following likelihood function: 

k 

L(^^:D) = n^/^ 

i=i 

where 9j is the probability of the outcome j and Nj is the number of times the outcome j is 
observed in D. 

As in the binomial case, we have once again omitted the coefficient (this time jv-^ijv^T' n^'. ^^ ~ 
A'^i + . . . + Nk) for it "falls off" after the log likelihood function is derived. 
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In the same way as before, we get that the MLE is 

In Bayesian networks, when we assume that P{xi \ pai) is multinomial, we get further 
decomposition. From equation 8.11 we get: 

L{ei : D) = l[P{xi[m] \ Pai[m] : 9^) 

m 

We now write the above product expression in a slightly different way, by grouping products 
according to a certain combination of the parents' values (for example, in the network de- 
picted in figure 8.10, a combination of A's parents E and B may be = 1 and B = 2). We 
once again denote by pai each combination of the parents and go over all such combinations. 
Now, for each poj we multiply only observations where Pai[m] = pai. Thus, 

^ n n P{xi[m] I pai : Oi) 

pai m:Pai[m]=pai 

Looking at the second product expression, we now "group" it even further- for each combi- 
nation of the parents, we take a certain value of Xi and raise its probability to the power of 
that certain value's number of appearances: 

= nn^(^^ I pa^ : ^,)^^""""'^ = nn^->ip»^''^'"'"^^ 

pai Xi pai Xi 

For each combination pai of the parents of Xi we get an independent likelihood function 
Ylx ^x.ipai^^^"^""'^ ■ After performing log on the last expression and finding its derivative we 
get'that the MLE is: 

^ _ N{xi,pai) 
^Xilpai- ^^^^^^.^ 

This is essentially as in figure 8.11, but for the multinomial case. 
8.4.7 Bayesian Learning 

Unlike the MLE approach, the Bayesian approach doesn't estimate the most likely 9, but 
rather takes into account all possible ^^s with their probabilities. We can represent our 
uncertainty about the sampling process using a Bayesian network, as seen in figure 8.12. 
All the data are dependent on the certain 9 that "created" them. As seen in the Markov 
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Observed data 



Figure 8.12: Representing Bayesian prediction as inference in Bayesian network 

assumption, given 6 (the parent) the values of the different x[i] are independent. 
The Bayesian prediction is inference in this network: 

P{x[m + 1] I x[l],...,x[m]) = [ P{x[m + 1] \ 9,x[l], . . . ,x[m])P{9 \ x[l], . . . ,x[m])d9 







where (*) is because of the conditional probability and total probability theorem for a con- 
tinuous variable ^ (0 < 6' < 1). Since the first probability term in the integral is for a given 
9, x[m + 1] is independent of its non-descendants and we can therefore write: 



[ P{x[m + l]\9)P{9 \x[l],...,x[m])d9 (8.11) 
Jo 



Now let us look at the second term of the integral. 



likelihood prior 



, r n r ^^ P (x\l] , . . . , x\m] I 9)P(9) 

P{9 x[l],...,x[m]) ^ rii n 8-^2 

• P[x[l\, . . . ,x[ni\) 



posterior ^ayes rule 



probability of data 



The prior is some earlier information we have regarding the random variable 9. 
Using the total probability theorem, we may rewrite the denominator and get: 

P{x[l],...,x[m]) = ! P{x[l],...,x[m]\9)P{9)d9 
Jo 

and therefore the expression on the right hand side of equation 8.12 can be written as 

P(x[l],...,x[m] I 9)P{9) 
/o P(a;[l],...,x[m] | 9)P{9)d9 

Note that P(x[l], . . . , x[m]) does not depend on 9 (it goes over all values of 9 in its equivalent 
integral expression). As a result, it behaves as a constant in formula 8.11 and can be taken 
out of the integral there (see the example in the following section, 8.4.8). 
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8.4.8 Binomial Example: MLE vs. Bayesian Learning 

We return to our coin toss example. Recall that P{H) — 9. This time, however, we also have 
prior information that the distribution of 6 is uniform in [0, 1]. Therefore, P{6) = f{9) — 1 
(no substantial information about 9). In addition, the data is {Nh,Nt) — (4, 1) (there are 
four heads and one tail). 

According to the MLE general case found earlier, the MLE for P{X — H) is j^^^j^^ = | = 
0.8. 

The Bayesian prediction is: 

P{x[m + 1] = H \ D) = [ P{x[m+1] = H \e)P{e\x[l],...,x[m\)de^ 

Jo 

Jo ^ ^ ^ ' ^ P{x[l],...,x[m]) 

^ P{x[m + 1] = H I e)P{e)P{x[l],...,x[m] \ 9)de _ 

p{e)p{x[ii...,x[m]\e)de 
jli-e^H{i-e)NTde ~ j^e^-{i-e)de 

= — = 1 f = 0.7142 

J,' {e^ - e^)de \-\ 7 

Since our earlier knowledge was 0's uniform distribution, we were not "tempted" to believe, 
after only 5 tosses, that P{H) is as extreme as 0.8, but rather slightly lower, toward the 
uniform 0.5. We can see that the prior has given us a more "balanced" result. It seems as 
though we have added two more tosses, one of which was H, namely the prior has given us 
a more established result, based on our earlier knowledge. 

The differences between the MLE approach and the Bayesian approach are summarized 
in table 8.8. 



8.4.9 Dirichlet Priors 

In example 8.4.8, we saw an example of a uniform prior in [0, 1], and thus P{9) = f{6) = 1. 
Let us now look at the following density function (recall that ^ is a random variable), called 
Dirichlet's function: 

(E,tl«j-1)! A a-l 

^^^^ ^ (ai-l)!(a2-l)!...(a,-l)! ' 
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MLE approach 


Bayesian approach 


Assume that there is an unknown 


Represents uncertainty about the 

UllKllUWll pcllcLllit;Lt:l U. 


Estimate 9 with some confidence. 


Uses probability to quantify this 
uncertainty. Unknown parameters as 

T* Q T1 ^1 f^YY~k T TQ T*l Q r~l 1 OO 

1 aiiQuiii vdjiiauico. 


Prediction using the estimated parameter 
value. 


Prediction foUows from the rules of 
probability- expectation over the 
unknown parameters. 



Table 8.8: The differences between tlie MLE and Bayesian approax:lies 



This is the density function of a k-dimensional random variable 6 with hyperparameters 
CKi, . . . , CKfe- If we know these hyperparameters, we will know the density function explicitly 
as a function of 9i, ... ,9k- Note that we can "drop" the coefficient (it is a normalization 
constant) and get: 

k 

P{9)^l[9p-' 

We now use the Dirichlet distribution as a prior. Recall that the hkehhood function for 
multinomial data is: 



k 

=1 



L{9:D)^P{D\9)^l[9j 
Prom equation 8.12 we get: 

k k k 

P{9 I D) cx P{9)P{D \9)o^Y[ 9^-^ = (8.13) 

i=i j=i i=\ 

We can see that the posterior density function P{9 \ D) has the same form as the prior P{9), 
with hyperparameters aj + Nj — l, . . . ,ak + Nk — l. Therefore we can conclude that Dirichlet 
is closed under multinomial sampling (i.e. both the prior and posterior distributions are 

Dirichlet). Note that in the transition marked (*) we disregarded the denominator P{D) from 
equation 8.12. This denominator and the normalization constant, which was "dropped" from 
Dirichlet 's function, give us the normalization constant of our new Dirichlet distribution. 
Since the posterior is Dirichlet, we get: 



P{x[m + 1] = j I D) ^= j 9j ■ P{9 I D)d9 ^- 



where (*) is from equation 8.11 and (**) is the result of solving the integral with P{9 \ D), 
which appears in equation 8.13. We can learn from the above expressions that a indicates 
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the "degree of influence" tlie prior has on the posterior result. The larger a is, the greater the 
effect the prior has, because it means we have simulated a situation with more experiments 
(for each j we have aj + Nj experiments instead of the A'^ we had in the likelihood function) 
and our "added" experiments aj have a greater part out of the whole aj + Nj experiments. 
We can now generalize: 

P{Xi[m + 1 = J D) = . , = — ^TTT ^ 

2^i[a[Xi = l,pai) + N{xi = l,pai)) a{pai) + N{pai) 



Learning Multinomial Pcirameters: Sumniciry 

We count N{xi,pai) and get the following estimations: 

^ _ N{xi,pai) ^ _ a{xi,pai) + N{xi,pai 
N{pa,) ^^^1^"^" a{pa,) + N{pa,) 



MLE Bayesian (Dirichlet) 

Both can be implemented in an on-line manner by accumulating counts. 



8.4.10 Learning Structure 

As described in section 8.4.2 (The Learning Problem), given a training set D, our goal is 
to find a network that best matches D. This time, however, the structure is unknown and 
we wish to learn it on top of learning the parameters. We therefore describe the following 
optimization problem: 
Given an input of: 

• Training data 

• A scoring function (including priors) to measure our suggested network's match with 
the data 

• Set of possible structures- a set of networks (DAGs, trees, etc) from which we need to 
find the best structure 

we need to output the network (or networks) that maximize the score. 

As before, we would like to take advantage of the dccomposability property, and therefore 
we look for a scoring function such that the score of the network will be the sum of the 
nodes' scores. An example for such a score is the BDE score: 

P{G I D) oc P{D I G)P{G) ''^ P{G) I P{D \ G, 9)P{9 \ G)d9 
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where (*) holds due to the total probability theorem for P{D \ G). Following particular 
assumptions, the score satisfies the decomposability property: 



The problem of finding the optimal structure is NP-hard and we therefore address the 
problem by using heuristic searching. We traverse the space of possible networks, looking 
for high-scoring structures. This can be done by using searching techniques, such as: 

• Simulated Annealing 

• Greedy hill-climbing- we start with a certain structure and we add modifications until 
we reach a local maximum. In this technique, we save time by not going over all the 
possibihties. 

The typical operations performed in our heuristic search can be seen in figure 8.13- we 
can add, remove or reverse the direction of an edge, thus creating a new structure, which 
has a new score we can calculate. Traversing all possible edges and applying the above 
modifications, we can find the highest-scoring new structure and reiterate the process from 
it. The number of edge modifications in each step of the greedy hill-climbing algorithm is 
O(n^), where n is the number of nodes (because there are O(n^) edges). Once again, the 
importance of the decomposability property is demonstrated- each modification of a certain 
edge is local to the node that particular edge enters. The decomposability enables us to 
update the score of either only one node (in case of an addition or removal of an edge) or 
two nodes (in case of a reverse in an edge's direction), without having to recalculate the 
entire network's score. 



Score{G : D) = J2 Score{xi \ Paf : D) 




Figure 8.13: The typical operations performed in the heuristic search 



Applications Example: DNA Binding Sites Motifs 
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8.5 Applications Example: DNA Binding Sites Motifs 

8.5.1 Introduction 

Our goal in this example is to analyze and predict DNA binding sites. 

Representing Sites: Given a collection of known binding sites, which bind a certain tran- 
scription factor, we wish to develop a representation of those sites that can be used to search 
new sequences and reliably predict where additional binding sites occur. So far in the course 
we have already studied four methods to achieve this: 

• Consensus sequence- the sequence that represents the collection of binding sites. For 
more details refer to the multiple alignment scribe, sections 4.2.2 and 4.2.3. 

• Profile- profiles of binding site motifs are frequently called PSSM (Position Site 
Specific Matrix). For more details refer to the multiple ahgnment scribe, section 4.4.1. 

• Profile HMM- generalizes the profile method. For more details refer to the Hidden 
Markov Models scribe, section 5.2 (Baum- Welch training). 

• Bayesian Networks (modeling independencies)- this method is unique in modeling 
dependencies between nodes which are not adjacent to each other. 

8.5.2 Representing Sites example 

In order to build any model, we need a Training set. In our example the training set is a set 
of sequences which are known to be binding sites, as specified in table 8.9. 

Training set 
AGACT 
AGTCT 
ACAGT 

ACTGT 
AGTCT 

Table 8.9: Training set of DNA binding sites 

If we have more data than what is used for the training set, that data enables us to 
estimate the quality of our model. In this example, we know that the genome sequences in 
table 8.10 are binding sites, and the sequences in table 8.11 are known non-binding sites. 

In the following definitions True and False refer to whether in reality the data is, or is not, 
a binding site, while Positive and Negative refer to the prediction of the model: whether 
the model predicts the data to be, or not to be, a binding site. 
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Additional binding sites 

1) AGACT 

2) AGTCT 

3) ACAGT 

4) ACTGT 

Table 8.10: Genome sequences, known as binding sites 



Non-Binding sites 

5) ACTCT 

6) AGTGT 

7) ACACT 

8) AGAGT 

Table 8.11: Genome sequences known as non-binding sites. 



Definition Sensitivity - The ratio of the true positives out of all the true data. Meaning, 
how much of all the data that is really in the set, was predicted as such by the model. 

Definition Specificity - The ratio of the true positives out of all the positives. Meaning, 
how much of the data which was predicted by the model to be in the set, is really in it. 

A good model is one that will show high ratios in both indexes. 
Consensus Representation 

The consensus of the training set (table 8.9) is AGTCT. Searching in the known genome 
sequences (tables 8.10 and 8.11) allowing no mismatches will give: 

• 1 true positive (sequence 2, which is identical to the consensus). 

• 3 true negatives (1, 3, 4). 

• No false positives. 

This is perfect specificity (1), but low sensitivity (|). On the other hand, searching with the 
consensus allowing one mismatch will give us the following: 

• 2 true positive (1, 2). 

• 2 true negatives (3, 4). 
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• 2 false positives (5, 6). 
Which gives us a specificity of | and a sensitivity of |. 

PSSM 

The PSSM matrix contains the score of each nucleotide in each position, which is derived 
from their frequency in each location in the training set. A PSSM matrix for the training 
set is described in table 8.12. Using the PSSM for a high probability search will yield: 





1 


2 


3 


4 


5 


A 


1 





2/5 








G 





3/5 





2/5 





C 





2/5 





3/5 





T 








3/5 





1 



Table 8.12: PSSM matrix of training set in table 8.9 



• 4 true positive (all of the known binding sites). 

• No true negative. 

• 4 false positives (all of the known non-binding sites). 
This is a specificity of |, but a perfect sensitivity (1). 

Bayesian Networks 

Note that in the sequences of the training set if position 2 is G then position 4 is C, and 
vice versa: if position 2 is C, then position 4 is G. The reason for the PSSM's low specificity 
stems from the fact that it cannot represent such dependencies, and so it accepts sequences 
which have cither C or G in both positions. Bayesian networks are able to express such 
dependencies, and therefore will be ideal for this example. The Bayesian network for this 
example's training set is presented in figure 8.14. 

Running an high probability search on the Bayesian network model will give us: 

• 4 true positive (all of the known binding sites). 

• No true negative. 

• No false positives. 
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Figure 8.14: Bayesian network produced using the training set on table 8.9. This model is achieved if no priors are used 
in the learning process (see section 8.4.2) 

This is a perfect specificity (1) and perfect sensitivity (1) ! 

Note that the Bayesian method gave such high results, since in this hypothetical example, 
dependencies existed between the nucleotides of the binding sites. It is not certain that such 
dependencies exist in nature as well. The next section will present an attempt to find out if 
there are dependencies of this kind in nature. 

8.5.3 Modeling Dependencies in Protein-DNA Binding Sites 

This section describes the application of Bayesian network by Barash et al. (2003) [1] 
for modeling dependencies in protein-DNA binding sites. From the TRANSFAC database 
[3] they acquired a set of 95 transcription factors, for each at least 20 short binding sites 
were known. For each transcription factor two models were created, one applying PSSM 
profile, and one applying Bayesian networks (in order to simplify the process the Bayesian 
network was restricted to a form of tree). Having the models, the goal is to estimate the 
generalization of each models created. This is not as simple as demonstrated in section 8.5.2, 
since no additional binding sites are available. In such cases, where only few sequences are 
available and are all used for the training set, we can exclude from the training set a small 
subset of sequences, use the rest of the sequences to create the model, and then test each 
of the excluded sequences for its probability given the model. Repeating that process for 
each small subset will give us an indication about the quality of the model. This technique 
is called Cross Validation. Barash et al. (2003) [1] used cross validation in the following 
manner: each one of the binding sites was excluded from the training set, a model was 
created and then the log likelihood to receive the excluded sequence given the model was 
calculated. Afterward the average log-likelihood of all the sequences was calculated, and 
used to evaluate the quality of the model. 

One of the transcription factors that were examined by Barash et al. (2003) was Ara- 
bidopsis ABA binding factor 1. 49 examples were used for its modeling using profile and 
Bayesian networks models. Figure 8.15 is a graphical representation of the profile created. Its 
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cross validation test average log-likelihood was -19.93. The Bayesian network model (figure 
8.16) received a higher average log-likelihood of -18.47. Notice the Bayesian network shows 
high dependency between positions 5 and 6, this dependency couldn't be expressed in the 
profile model, and is probably partially responsible for the higher likelihood in the Bayesian 
model. In order to further check the quality of Bayesian networks vs. profiles, 105 yeast 




Fig lire 8.15i Profile of Arabidopsis ABA binding factor 1, based on 49 examples 




Fig Ure 8.16) Bayesian network for Arabidopsis ABA binding factor 1, based on 49 examples 

transcription factors data sets from Lee et al. (2002) [2] were used. The data set of each 
transcription factor included the strength of binding to each genome promoter. Using this 
information a profile and a Bayesian network were created for each transcription factor, and 
their sensitivity and specificity were calculated. Figure 8.17 presents the difference between 
the Bayesian network (tree) and the profile in specificity and in sensitivity. Note that in 
most cases in which there was a difference between the two models the Bayesian network 
gave better results. These findings imply that in the transcription factor's binding sites there 
are indeed certain dependencies between the nucleotides in the different locations. 
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Figure 8.17: Difference between Bayesian tree and profile of yeast binding sites. In 30 out of 54 sites, there is an 
improvement both in sensitivity and in specificity by using the Bayesian method instead of the profile. 
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